Systems and methods for missing data imputation

ABSTRACT

Congestive heart failure (CHF) is a leading cause of death in the United States. WANDA is a wireless health project that leverages sensor technology and wireless communication to monitor the health status of patients with CHF. The first pilot study of WANDA showed the system’s effectiveness for patients with CHF. However, WANDA experienced a considerable amount of missing data due to system misuse, nonuse, and failure. Missing data is highly undesirable as automated alarms may fail to notify healthcare professionals of potentially dangerous patient conditions. Embodiments of the present disclosure may utilize machine learning techniques including projection adjustment by contribution estimation regression (PACE), Bayesian methods, and voting feature interval (VFI) algorithms to predict both non-binomial and binomial data. The experimental results show that the aforementioned algorithms are superior to other methods with high accuracy and recall.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of United States PatentApplication No. 14/241,431 filed Feb. 26, 2014, now United States PatentNo. 11,450,413, which application is a 371 United States national phaseapplication of PCT/US2012/052544, filed Aug. 27, 2012, which claimspriority to U.S. Provisional Patent Application No. 61/528,065 filedAug. 26, 2011, the contents of which are incorporated herein byreference in their entirety.

STATEMENT OF GOVERNMENT SPONSORED RESEARCH

This invention was made with Government support under Grant No.LM007356, awarded by the National Institutes of Health. The Governmenthas certain rights in this invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a block diagram of a system 100 for collecting andimputing patient health data.

FIG. 2 illustrates one example of an embodiment of an HFSASquestionnaire.

FIG. 3 is a table showing the correlation coefficient values of eachmethod.

FIG. 4 illustrates the accuracies of the obtained values range between83.2% and 98.5%.

FIG. 5 is a table that illustrates the experimental result, and showsthat naive Bayes, Bayesian network, and VFI have recall values ofup to0.7 for weight, 0.714 for systolic blood pressure, 0.889 for diastolicblood pressure and 0.906 for heart rate values.

FIG. 6 is a table that illustrates the recall values.

DETAILED DESCRIPTION

Congestive heart failure (CHF) is a leading cause of death in the UnitedStates with approximately 670,000 individuals diagnosed every year. Thesequelae of CHF are well known, with frequent decompensation of thechronic state resulting in recurrent hospitalizations. Experts believethat constant monitoring of patients with CHF is important to the healthof such patients. Remote patient monitoring is a promising solution foran expanding population of CHF patients who are unable to access clinicsdue to insufficient resources, inconvenient location, or advancedinfirmity. Medical care facilitated by remote technology has thepotential to enable early detection of key clinical symptoms indicativeof CHF-related decompensation. Such remote technologies can also enablehealth professionals to offer surveillance, advice, and continuity ofcare to trigger early implementation of strategies that enhanceadherence behaviors.

The WANDA (Weight and Activity) project is one example of a wirelesshealth project that leverages sensor technologies and remotecommunication to monitor the health status of patients with CHF. WANDAmonitors health-related measurements and other information deemedrelevant to CHF assessment, including weight, blood pressure, heartrate, activity, and daily somatic awareness scale questionnaires.Detailed descriptions of the WANDA system and its use for monitoring CHFpatients can be found in Suh, M. et al., “WANDA B.: Weight and activitywith blood pressure monitoring system for heart failure patients,” in2010 IEEE International Symposium on A World of Wireless, Mobile andMultimedia Networks (WoWMoM), 2010, pp. 1-6; Suh, M. et al., “Anautomated vital sign monitoring system for congestive heart failurepatients,” Proceedings of the 1st ACM International Health InformaticsSymposium, 2010; and Suh, M. et al., “A remote patient monitoring systemfor congestive heart failure,” Journal of Medical Systems, 2011, all ofwhich are incorporated herein by reference in their entirety for allpurposes.

It is desired for a remote monitoring system such as WANDA to collectand store all monitored vital signs. Any unhealthy changes in apatient’s vital signs should be addressed promptly in order to preventfurther degradation of a patient’s health. Unfortunately, the firstrandomized trial of WANDA experienced a considerable amount of missingdata. Only 33% of the somatic questionnaires were completed, and 55.7%of data had missing values for weight, blood pressure, and heart rate.Moreover, 22.2% of patients experienced system misuse and requested helpto accustom themselves to WAND A’s technologies. Missing data wasfurther caused by system nonuse and service disorder (such as a networkfailure, resulting in as much as 6.3% of all of the missing data).

Notably, other studies have experienced similar data loss. Missing datais especially common in randomized controlled trials. Wood’s studyshowed that 89% of 71 trials published in 2001 in well-known journals(the British Medical Journal, the Journal of the American MedicalAssociation, the Lancet, and the New England Journal of Medicine)reported having partly missing outcome values. Many studies applied lastobservation carried forward, worst case imputation, and complete caseanalysis techniques. However, such techniques may lead to biasedresults. To date, there has been no study on missing data imputation inCHF randomized trials.

One objective of embodiments of the present disclosure is to enhance theaccuracy of CHF missing data imputation using data mining techniques.Data imputation may allow a patient monitoring system to detect anunhealthy change in patient vital signs even when portions of that dataare not collected by the system. Embodiments of the present disclosureexploit the projection adjustment by contribution estimation (PACE)regression method for predicting and imputing non-binomial data suchquestionnaire responses. Bayesian methods and voting feature interval(VFI) are used to impute binomial data. Results of these methods may becompared using accuracy and correlation efficient values fornon-binomial cases, and recall values for binomial cases. Previousmethods may be compared with several other popular data mining methods.The experimental results show that PACE regression, Bayesian methods,and voting feature interval are superior to other methods for CHFpatient data imputation.

FIG. 1 illustrates a block diagram of a system 100 for collecting andimputing patient health data. Patient data is collected from a patient90 by at least one data collection device 102. As described above withrespect to WANDA, the at least one data collection device may include ascale, a heart rate monitor, a blood pressure monitor, a motion-sensingactivity monitor, and/or a computing device configured to collectquestionnaire answers. In one embodiment, the data collection device 102may be a separate device that collects data values from such devices atthe location of the patient 90.

The data collection device 102 transmits the data to a patient datacomputing device 104, where the patient data is stored in a raw datastore 106. In one embodiment, the data collection device 102 transmitsthe data to the patient data computing device 104 over a network such asa public switched telephone network; a wide area network; a local 10area network; the Internet; a wireless network such as 3G, 4G, L TE,GSM, Bluetooth, WiFi, WiMax; and/or via any other suitable networkingtechnology. In another embodiment, the data collection device 102 may betransported to the location of the patient data computing device 104,and may transmit the data to the patient data computing device 104 via adirect data connection between the devices, such as a USB connection, aFirewire connection, and/or the like.

A prediction engine 108 may then impute missing patient data values asdiscussed further below, and may store the imputed patient data valuesin a predicted data store 110. In some embodiments, the predictionengine 108 may search for missing values, and then perform thecalculations described below to predict the missing values. If thepredicted values are beyond threshold limits, such as a threshold limitspecified by a caregiver, the patient data computing device 104 maygenerate an alert to be presented to the caregiver. The alert mayinclude one or more predicted or measured values, which may then promptthe caregiver to check the status of the patient or to ask the patientto verify the predicted values. In cases where the predicted values donot match the actual status of the patient, the prediction engine 108may use the actual status as training data for a subsequent prediction.

In some embodiments, the prediction engine 108 may include one or morecomputer-executable components stored on a computer-readable mediumthat, if executed by a processor of a computing device, cause thecomputing device to perform the actions described below. In someembodiments, the prediction engine 108 may include one or more computingdevices specially configured to perform the described actions. In someembodiments, the raw data store 106 and the predicted data store 110 maybe databases managed by a conventional relational database managementsystem (RDBMS). One of ordinary skill in the art will recognize that theraw data store 106 and the predicted data store 110 may be separatedatabases, or may be stored in a single database. In other embodiments,the raw data store 106 and/or the predicted data store 110 may use anyother suitable storage method, such as a structured query language (SQL)file, a spreadsheet, a text document, and/or the like.

In some embodiments, the patient data computing device 104 may includeat least one processor, an interface for coupling the computing deviceto the data collection device 102, and a nontransitory computer-readablemedium. The computer-readable medium may have computer-executableinstructions stored thereon that, in response to execution by theprocessor, cause the patient data computing device 104 to perform thecalculations described further below. One example of a suitablecomputing device is a personal computer specifically programmed toperform the actions described herein. This example should not be takenas limiting, as any suitable computing device, such as a laptopcomputer, a smartphone, a tablet computer, a cloud computing platform,an embedded device, and/or the like, may be used in various embodimentsof the present disclosure. One of ordinary skill in the art willrecognize that the components illustrated as part of the patient datacomputing device 104 may be combined into a single component, or mayeach be split apart into multiple components. Further, the patient datacomputing device 104 may be a single computing device that stores and/orexecutes each of the illustrated components, or may include multiplecomputing devices communicatively coupled to each other that each storeand/or execute part or all of the illustrated components.

Non-Binomial Case Imputation

In one embodiment, WANDA may employ the Heart Failure Somatic AwarenessScale (HFSAS) which is a 12-item Likert-type scale to measure awarenessof signs and symptoms specific to CHF. A 4-point Likert-type scale isused to ascertain how much a patient is bothered by a symptom (0: not atall, 1: a little, 2: a great deal, 3: extremely). FIG. 2 illustrates oneexample of an embodiment of an HFSAS questionnaire. In order to predictmissing answers to such a questionnaire, embodiments of the presentdisclosure may use the projection adjustment by contribution estimationregression algorithm (PACE) (rounding any non-integer value returned byPACE). This method is based on maximum likelihood estimation (MLE) andan empirical Bayes framework to minimize the Kullback-Leibler (KL)distance between the original and the estimation function.

First, the PACE algorithm transforms parameters usmg MLE’s asymptoticnormality property to convert the original parameters. The algorithmutilizes the empirical Bayes estimator in (1 ):

$\theta_{i}^{m} = \frac{\int{\theta f\left( {\left( x_{i} \right|\theta} \right)dG_{k}(\theta)}}{\int{f\left( {\left( x_{i} \right|\theta} \right)dG_{k}(\theta)}}$

where θ (x) is the estimator f (x_(i) | θ_(i) ) is a probability densityfunction (PDF) and G_(k) is a consistent estimator of G which is themixing distribution of the mixture f_(G) = ∫f(x|θ)dG. Using (2), thedeveloped algorithm minimizes the KL distance between f andf in (2):

$\Delta_{KL}\left( {f,\widetilde{f}} \right) = E_{f}\log\left( \frac{f}{\widetilde{f}} \right) = {\int{\log\left( \frac{f}{\widetilde{f}} \right)}}f\mspace{6mu} dx$

This method may show better results in high dimensional data spaces, andwas applied to complete cases that have all 12 answered questions toevaluate the accuracy.

Binomial Case Imputation

A binomial approach may be used to predict alarms normally triggered byabnormal data values ( e.g., drastic weight changes, unhealthy bloodpressure, etc.) given missing data. For example, the system may beconfigured to trigger an alarm if a patient has an extreme change inweight - even when the extreme weight value is missing from the datacollected by WANDA. Embodiments of the present disclosure may use naiveBayes, a Bayesian network, and VFI to detect such changes in order toalert caregivers. Naive Bayes and Bayesian network classifiers arealgorithms that approach the classification problem using theconditional probabilities of the features. A Bayesian network is adirected acyclic graph (DAG) over a set of variables X, where theoutgoing edges of a variable xi specifies all variables that depend onxi. The probability of an outcome is determined as:

P(X) = Π_(x ∈ X)p((x|par(x))

where X = {_(x1), _(x2), ... , x_(k)} is a set of variables, and par(x)is the set of parents of x in a Bayesian network. The probability of theinstance belonging to a single class may be calculated by using theprior probabilities of classes and the feature values for an instance.Naive Bayesian method assumes that features are independent and thereare no hidden or latent attributes in the prediction process. As such,the experimental results for naive Bayes and Bayesian network can beslightly different as p( class) =

$\begin{array}{l}{\frac{1 + \text{N}\left( \text{class} \right)}{\text{N}\left( \text{class} \right) + \text{N}\left( \text{instances} \right)}\text{for na}ï\text{ve Bayes and p}\left( \text{class} \right) =} \\{\frac{\frac{1}{2} + \text{N}\left( \text{class} \right)}{\text{N}\left( \text{class} \right) \times \frac{1}{2}\text{+N}\left( \text{instances} \right)}\text{for}}\end{array}$

Bayesian network where N(x) is the number of sets or instances.

VFI is a categorical classification algorithm and considers each featureindependently as Bayes methods. The classification of a new instance maybe based on a vote among the classifications built by the value of eachfeature. While training, the VFI algorithm constructs intervals for eachfeature. For the classification, a single value and the votes of eachclass in that interval are calculated for each interval. For each classc, feature f gives a vote value:

$\text{feature\_vote}\left\lbrack \text{f,c} \right\rbrack = \frac{\text{interval\_class\_count}\left\lbrack \text{f,i,c} \right\rbrack}{\text{class\_count}\left\lbrack \text{c} \right\rbrack}$

where interval_ class_ count [ f,i,c] is the number of instances ofclass c which is a member of interval i of feature f. The class with thehighest total vote is predicted to be the class of the test instance.

In the Bayes methods, each feature participates in the classification byassigning probability for each class and the final probability of aclass is the product of each probability measured on each feature. InVFI, each feature distributes its vote among classes and the final voteof a class is the sum of each vote given the features.

Subjects and Datasets

The WANDA system was used for health data collection on 26 differentsubjects. The population of the participants was approximately 68% male;40% White, 13% Black, 32% Latino, and 15% Asian/Pacific Islander; with amean age of approximately 68.7 ± 12.1. Study participants were allprovided with Bluetooth weight scales, blood pressure monitors, landline gateways, and personal activity monitor devices. Each captured datainstance for the study comprises 3 7 different attributes including, butnot limited to: timestamps; weight; diastolic/systolic blood pressure;heart rate; metabolic equivalents (METs); calorie expenditure; andnumeric responses to twelve somatic awareness questions. Each datainstance was gathered from each subject once a day. One thousand andninety instances were gathered.

The study used the missing at random (MAR) hypothesis. MAR assumes thatmissing data is dependent on observed data. Hence, missing data can bepredicted by resident data. All 1090 instances of data are complete(i.e., contain all 37 data values). Instances were divided into to twogroups: training and testing. Values from the testing set predicted bythe data imputation techniques were compared to their actual values toevaluate the effectiveness of each system.

Example Results

For non-binomial data, PACE, linear, simple linear and isotonicregression methods were applied. FIG. 3 is a table showing thecorrelation coefficient values of each method. Correlation coefficientis a measure of least square fitting to the original data. For a given Ndata points (X,Y), the correlation coefficient px,_(Y) is given asequation (5) where COV(X,Y) is a covariance between X and Y and ox, σyare standard deviation values of X and Y. The experimental results showthat PACE regression method works better on average than other givenregression methods.

$\rho_{X,Y} = \frac{COV\left( {X,Y} \right)}{\sigma_{X} \times \sigma_{Y}}$

After calculating the coefficient and constant variables, the developedalgorithm determines missing values using PACE regression (rounding anynon-integer value returned by PACE). The accuracies of the obtainedvalues range between 83.2% and 98.5%, as shown in FIG. 4 .

The binomial case predicts a potential abnormal vital sign when missingdata exist within WANDA’s database. C4.5, random tree, naive Bayes,Bayesian network, VFI, nearest neighbor, PART, DTNB, decision table, androtation table algorithms were applied and their recall values werecompared. For each method, ten-fold cross validation was applied. Inten-fold validation, the original sample is randomly partitioned intoten subsets and a single subset is held as a testing model, with theremaining nine subsets are used as training data. This cross-validationprocess is then repeated ten times, using a new subset as a testingmodel for each repetition. Recall values are given as:

$\text{recall}\mspace{6mu} = \frac{Tp}{Tp + Fn}$

where T_(P) is true positive and F_(n) is false negative. FIG. 5 is atable that illustrates the experimental result, and shows that naiveBayes, Bayesian network, and VFI have recall values ofup to 0.7 forweight, 0.714 for systolic blood pressure, 0.889 for diastolic bloodpressure and 0.906 for heart rate values.

Classifiers were trained in two ways. First, unique classifiers werecreated for each individual where only data collected from an individualwas used to predict values from the same individual. Second, a groupedclassifier was created using data from the entire population. Both theindividual and grouped classifiers were compared using tenfoldvalidation to test data from 16 patients. The recall values of weight,blood pressure, and heart rate are improved when training on the entiregroup’s data as compared with training each individual’s dataseparately. FIG. 6 is a table that illustrates the recall values. Forquestionnaire data, the accuracies of results were also better whentraining on all patients’ data. When training individually, 75% ofpatients’ data showed 0% accuracy. This is because the entire group hasbigger number of data and many individual share similarities inmonitored attributes, such as age, symptoms of CHF, etc.

The accuracy of the CHF missing data was enhanced using the PACEregression method for predicting and imputing non-binomial data; andBayesian methods and voting feature interval for binomial data. Theexperimental results show that PACE regression works better than linearregression, simple linear regression, and isotonic regression methodswith accuracy values of more than 83.2%. The experiment comparing Bayesand VFI methods with other algorithms proves that Bayes and VFIalgorithms work better (FIG. 5 ) with recall values of up to 0.7 forweight, 0.714 for systolic blood pressure, 0.889 for diastolic bloodpressure and 0.906 for heart rate values. This study also showed thatincreased accuracy is obtained by training on a large population asopposed to training the classifiers for each individual independently.

While a preferred embodiment of the invention has been illustrated anddescribed, it will be appreciated that various changes can be madetherein without departing from the spirit and scope of the invention.

What is claimed is:
 1. A system configured to impute missing patientdata for health care monitoring.